Harnessing the CRF Complexity with Domain-Specific Constraints. The Case of Morphosyntactic Tagging of a Highly Inflected Language

نویسنده

  • Jakub Waszczuk
چکیده

We describe a domain-specific method of adapting conditional random fields (CRFs) to morphosyntactic tagging of highly-inflectional languages. The solution involves extending CRFs with additional, position-wise restrictions on the output domain, which are used to impose consistency between the modeled label sequences and morphosyntactic analysis results both at the level of decoding and, more importantly, in parameters estimation process. We decompose the problem of morphosyntactic disambiguation into two consecutive stages of the context-sensitive morphosyntactic guessing and the disambiguation proper. The division helps in designing well-adjusted, CRF-based methods for both tasks, which in combination constitute Concraft, a highly accurate tagging system for the Polish language available under the 2-clause BSD license. Evaluation on the National Corpus of Polish shows that our solution significantly outperforms other state-of-the-art taggers for Polish – Pantera, WMBT and WCRFT – especially in terms of the accuracy measured with respect to unknown words.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic Morphosyntactic Raw Text Part of Speech Tagging System

Introduction and Overview: The topic of this dissertation is morphosyntactic part of speech tagging (abbreviated POS tagging) for Arabic. This topic has long and rich history for other languages, mainly for English. POS Tagging provides fundamental information about word forms used in sentences of natural language. The method of utilizing this information varies depending on the particular NLP ...

متن کامل

Identification of High-Frequency Morphosyntactic Structures in Persian-Speaking Children Aged 4-6 Years: A Qualitative Research

Background: Syntax has a high importance among linguistic parameters and the prevalence of syntax deficits is relatively high in children with language disorders. As such, independent examination of syntax in language development is of paramount importance. In this regard, Iranian language pathologists are faced with the lack of standardized tests. The present study aimed to determine the most ...

متن کامل

A Study of Inflectional Categories of Noun in Sistani Dialect

The present article aims to provide a synchronic study of the inflectional or morpho-syntactic categories of noun in Sistani dialect. These categories comprise person, number, gender or noun class, definiteness, case, and possession. Linguistic data was collected via recording free speech, and interviewing with 30 (15 females, 15 males) illiterate Sistani language consultants of age 40–102 year...

متن کامل

Turkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets

Sparsity is one of the major problems in natural language processing. The problem becomes even more severe in agglutinating languages that are highly prone to be inflected. We deal with sparsity in Turkish by adopting morphological features for part-of-speech tagging. We learn inflectional and derivational morpheme tags in Turkish by using conditional random fields (CRF) and we employ the morph...

متن کامل

Language Sample Analysis of Children With Cleft Lip And Palate: A Comparative Study

Background: Cleft palate (CP) with or without cleft lip (CL/P) are the most common  craniofacial birth defects.  Cleft lip and palate (CLP) can affect children’s communication skills.  The present study aimed to evaluate the language production skills in regards to morphology and syntax (morphosyntactic) of children with CLP . Method: In current cross-sectional study, 58 Persian-language child...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012